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INTRODUCTION 


In  pattern  recognition  literature,  and  in  the  area  of  discriminant  analysis,  one 
commonly  finds  likelihood-based  approaches  to  the  classification  problem  of  classifying  a 
p-dimensional  sample  x  into  one  class  from  a  population  of  classes  nit  (i  =  1,  ....  m), 
where  the  classes  jr,  are  assumed  to  be  multivariate  normally  distributed,  with  separate 
means  Hi  and  either  a  common  covariance  matrix  2  or  individual  covariance  matrices 
2, .  Fisher’s  linear  and  quadratic  discriminant  functions  are  examples  of  likelihood-based 
discriminants,  and  are  described  and  discussed  in  all  pertinent  literature.  A  good 
introduction  is  given  by  Duda  and  Hart  [1973].  Geisser  [1964]  discusses  a  Bayesian 
approach  to  classification  by  deriving  the  predictive  density  for  a  class.  In  the  Bayesian 
approach,  classification  of  a  new  sample  is  based  on  the  ratios  of  the  predictive  densities 
of  the  classes  for  the  data  sample  x.  The  subject  of  this  report  is  the  comparison  of  the 
misclassification  rates  of  predictive  and  likelihood  discriminant  functions  under  the 
constraint  of  small  training  sets  and  a  varying  number  of  training  samples  per  class. 

The  literature  comparing  and  contrasting  the  predictive  and  likelihood-based 
approaches  is  seemingly  contradictory  and  definitely  confusing.  Kendall,  Stuart,  and  Ord 
[1987]  note  that  a  complaint  against  the  linear  discriminant  function  is  that  it  does  not 
take  into  account  the  relative  sizes  of  the  training  sets  of  the  classes.  They  further  state 
that  the  approach  of  predictive  discrimination  yields  more  reliable  estimates  than  an 
equivalent  likelihood  approach.  However,  Raudys  and  Jain  [1991]  assert  that  Bayesian 
density  estimates  (predictive  densities)  do  not  improve  performance  over  the  quadratic 
discriminant  function  when  sample  sizes  are  different,  and  do  not  include  the  predictive 
discriminant  function  in  their  discussion  of  small  training  set  classification. 

Most  literature  acknowledges  that  the  Fisher  linear  and  quadratic  discriminant 
functions  are  asymptotically  optimal  for  Gaussian  population  classes  (see  Anderson 
[1984]).  However,  since  optimality  is  asymptotic  property,  i.e.,  is  true  for  large  samples 
of  the  data,  the  functions  are  not  necessarily  optimal  for  small  sample  sizes.  As  a 
consequence,  Enis  and  Geisser  [1974]  claim  that  the  Bayesian-derived  predictive  density 
is  optimal  in  minimizing  the  probability  of  misclassification.  And  asymptotically,  the 
predictive  density  approaches  the  linear  and  quadratic  in  functional  form.  Thus  it  can  be 
viewed  that  the  optimality  (as  a  function  of  sample  size)  of  the  likelihood-based 
discriminant  functions  is  a  function  of  how  fast  the  predictive  and  likelihood  densities 
converge. 

In  many  fields  a  dilemma  exists,  where  classification  of  data  into  a  set  of  classes  is 
desirable,  but  it  is  impossible  or  too  expensive  to  obtain  a  reliable  and  large  training  set. 
Small  training  sets  are  therefore  generally  the  rule  in  these  fields.  Classification  is  thus 
attempted  by  using  a  small  number  of  training  samples  from  each  class.  In  addition,  the 
number  of  training  samples  from  each  class  is  generally  different.  Small  training  samples 
and  differing  numbers  of  training  samples  for  each  class  create  a  problem  for  the 
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likelihood-based  discriminant  functions.  Marks  and  Dunn  [1974]  analyzed  Fisher’s  linear, 
quadratic,  and  linear  “best”  discriminant  functions,  when  sample  size  is  small,  i.e., 
n  =  10  —  100-  When  sample  sizes  are  moderate  (i.e.,  n  =  100  —  500),  Wahl  and  Kronmal 
[1977]  found  that  sample  size  was  critical  in  choosing  between  the  the  linear  and 
quadratic  discriminant  functions,  even  when  the  covariance  matrices  are  unequal. 
However,  both  papers  excluded  an  analysis  of  the  properties  of  the  predictive  dis¬ 
criminant  function,  and  this  omission  forms  the  basis  for  this  report. 

This  report  investigates  the  small-sample  misclassification  rates  of  traditional  likeli¬ 
hood  and  predictive  procedures.  First,  the  likelihood  and  predictive  approaches  to 
classification  are  introduced.  Second,  Monte  Carlo  simulations  compare  the  misclassifica¬ 
tion  rates  of  the  predictive  density  discriminator  and  Fisher’s  linear  and  quadratic 
discriminant  functions.  The  misclassification  rates  of  these  functions  are  compared  in  the 
univariate  case  under  the  assumptions  of  the  classes  having  (1)  the  same  variance  a  2  and 
(2)  differing  variances  a2-  Simulations  also  vary  parameters  concerning  class  separation 
and  sample  size.  These  parameters  are  (1)  the  separation  between  class  populations, 
(2)  the  number  of  training  samples  for  each  class,  and  (3)  the  total  number  of  training 
samples.  Third,  a  Monte  Carlo  simulation  measures  misclassification  rates  in  a  multi¬ 
variate  case  using  Fisher’s  Iris  data.  A  conclusion  follows  the  simulation  results. 

Note  that  the  decision  theory  concept  of  associating  a  cost  of  misclassification  with 
each  class  is  not  pursued  in  this  report.  For  our  analysis  of  the  likelihood  and  predictive 
techniques  of  classification,  the  cost  of  misclassification  is  considered  equal  for  all  classes 
and  is  therefore  not  considered. 
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LIKELIHOOD  APPROACH 


Let  the  prior  probabilities  for  classes  n\  and  n2  be  given  as  p 2  and  p2,  respectively. 
For  a  two-class  case  with  known  parameters,  i.e.,  ji1  ~~  n2  -  N(u2, 22),  assign 

an  observation  x  (a  p-dimensional  array)  to  a  class  in  the  following  manner: 

Assign  to  class  7t\  :  ^  1 

*  L(x|p2,I2)p2 

»  •  .  ,  Xi)Pi  _ , 

Assign  to  class  n2  :  -771 - —  <  1 

L(x\p2,  l2)p2 

L(x|//i,Zi)  is  the  likelihood  of  observing  x  from  a  normal  population  2]).  If  the 
number  of  classes  is  greater  than  two,  then  the  classification  rule  is  modified  to  determine 
the  most  likely  class  given  x: 

Assign  to  class  nk :  L(x\pk,  2*)p*  >  L(x |p„  2i)p,-  Vi  *  k 

In  the  univariate  case,  we  replace  the  covariance  matrix  2/  and  mean  vector  p,  with 
their  respective  scalar  values  a2  and  p,  : 

Assign  to  class  n\  :  ^j^1’  0'2?'  ^  1 

L(x\p2,  cr?)p2 


Assign  to  class  n2  :  L(x |pi,  o2)p\ 


<  1 


L{x\p2,  o22)p2 

The  multiclass  univariate  rule  is 

Assign  to  class  nk  :  L(x|p*,a*2)p*  >  L(x|/uj,  <7,-2)p,  Vi  *  k 

When  the  parameters  are  unknown,  one  replaces  the  parameters  with  the  “best”  (in 
some  sense)  estimates  for  the  parameters.  In  the  multivariate  normal  case,  we  use  for  px 
and  £* 

N i 


1  1  ‘ 
x>  -  f  2 

/y«  ;=  1 


Ns 


(Nt  -  1)S,-  =  2!  (x«;  ~  xi')(xy  “  X,) 
y  =  i 

respectively,  where  is  the  number  of  training  samples  and  the  s  corresponds  to 
training  samples  from  class  7tx  • 
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The  linear  and  quadratic  discriminant  functions  will  be  derived  for  the  two-class  case 
with  unknown  parameters.  Note,  however,  that  the  previously  mentioned  modification  of 
the  two-class  case  is  easily  applied  to  these  functions  to  derive  the  multiple-class  decision 
rules  for  the  discriminant  functions. 

FISHER’S  LINEAR  DISCRIMINANT  FUNCTION 

If  the  covariance  matrices  (of  dimension  p  x  p)  are  assumed  equal  but  unknown,  i.e., 
Zi  =  12  -  2 ,  one  can  improve  S,  the  estimate  of  2 ,  by  using  a  weighted  average  of  the 
sample  variances: 

Nx  Nz 

(A/i  +  N2  -  2)S  =  £  (X1<  -  3Ci)(xi,  -  Xi)  +  2  (x2/  -  x2)(x2y  -  x2) 
i  =  1  ;  =  1 

The  likelihood  decision  rule  becomes,  with  associated  prior  probabilities  p i  and  p2 

(2jt)  1/,2|S|  1//2exp  -  i  (x  -  iTO'S'1  (x  -  xj)pi 

Assign  to  class  7i\ :  - ^ -  2  1 

(2jt)1/2(S|3  /2exp  -  -  (x  -  x2)'S_1  (x  -  *2)P\ 

otherwise  assign  to  class  tc2  (ratio  is  <  1).  Cancelling  the  constants,  multiplying  the 
prior  probabilities  to  the  other  side,  and  taking  the  logarithm  of  both  sides,  the  equa¬ 
tion  can  be  written  as  the  following: 

11  n 

Assign  to  class  Jtj  :  —  (x  -  x2) 'S-1  (x  -  x2)  —  (x  -  STj)  'S-1  (x  -  xi)  2  In — 

2  2  pi 

Expanding  the  left  side  and  reducing,  one  is  left  with  the  linear  discriminant  function: 
Assign  to  class  7t\  :  x'S-1(xi  -x2)  -  — (x- x2)'S-1(xj  -x2)  2  In— 

2  pi 

Assign  to  class  n2  :  x'S_1(xi  -  x2)  (x  -  x2)'S_1(xi  -x2)  2  In— 

2  pi 

This  equation  describes  a  hyperplane  decision  surface,  which  is  perpendicular  to  the 
vector  between  the  means  xi  and  x2.  For  a  derivation  of  these  equations,  see  Ander¬ 
son  [1984]. 

For  the  univariate  case,  one  replaces  vectors  with  the  corresponding  scalar  values.  The 
linear  discriminant  function  now  becomes 

.  ■  ,  x(xi-x2)  1  .  _ . ,  p2 

Assign  to  class  Jtj  : - - — -  (x3  +x2)(xj  -x2)  2  In— 

S  2S  p\ 

A  ■  ,  x(Xi-X2)  1  „  w_  .  ,  p2 

Assign  to  class  n2  :  — 1 - — - — 1  — -(xi  +x2)(xj  -x2)  2  In— 
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If  pi  =  p2 ,  then  the  logarithm  of  their  ratio  equals  0,  and  the  univariate  linear 
discriminant  equation  can  be  further  reduced  to  the  following: 

Assign  to  class  7i\  :  xfa  -x2)  >  (Ji  +  x2)  (Jj  -  x2) 

M 

Assign  to  class  n2  :  x(Ti  -x2)  >  ~  +  x2)  -x2) 

This  is  equivalent  to  finding  on  which  side  observation  x  lies  in  relation  to  the 
midpoint  of  the  line  segment  between  the  class  means  x)  and  x2  (the  hyperplane  is  a 
point).  A  graphical  example  of  the  univariate  linear  discriminant  function  for  two  classes, 
with  x-i  and  x2  equal  to  4.0  and  6.0,  is  given  in  figure  1. 


Figure  1.  Linear  discriminant  function. 

FISHER’S  QUADRATIC  DISCRIMINANT  FUNCTION 

In  the  p-dimensional  multivariate  case,  if  the  covariance  matrices  of  the  classes  are 
not  considered  equal,  then  the  decision  surface  is  not  linear  (a  hyperplane),  but  quadratic. 
Hence  the  term  quadratic  discriminant  function.  Consider  the  case  with  two  classes  n\ 
and  n2 ,  with  respective  prior  probabilities  P\  and  p2,  sample  mean  vectors  xj  and  x2,  and 
sample  covariance  matrices  Sj-1  and  S2_1-  The  quadratic  discriminant  function  is  based  on 
the  ratio  of  the  likelihood  functions  given  x  times  the  respective  prior  probability,  and  can 
be  initially  written  (without  simplification)  as  the  following  rule: 

(2jt)  1/2|Si  |  J/2exp  -  (x  -  xi)  'S{*  (x  -  x^pi 

Assign  to  class  n\  : - - -  >  1 

(2^)1/2jS1|1/2exp  -  —  (x  -  xO'Sj^x  -  5c \)p2 

Otherwise  assign  to  class  n2. 
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This  function  can  be  written  in  a  simpler  form.  Taking  the  logarithm  of  both  sides  and 
simplifying  the  equation  gives  a  general  form  of  quadratic  discriminant  function: 

Assign  to  class  nx  :  (x  -  x^'S^x  -  x2)  -  (x  -  x^'S^x  -  xj)  <  ln-J^j-  +  2  In— 

l»il  Pi 

lO  I  y- 

Assign  to  class  n2  :  (x-x2)'S2‘1(x-x2)  -  (x-  x^'S^Cx-xj)  ^  lnj-^-  +  2In  — 

|hi|  P2 

If  the  prior  probabilities  P2  and  Pi  are  equal,  a  logarithm  of  their  ratio  equals  0,  and 
the  quadratic  discriminant  function  can  be  rewritten  as 

IS  I 

Assign  to  class  nx  :  (x  -  x2)'S24(x  -  x2)  -  (x  -  x^'S^  (x  -  Xj)  <  ln-j-^- 

l»i| 


Assign  to  class  n2 :  (x  -  x^'Sf^x  -  x2)  -  (x  -  x^'S^  (x  -  Xj)  <  ln|-~ 

|bi| 

For  the  univariate  two-class  case,  one  again  replaces  the  vectors  with  scalar  values, 
and  the  quadratic  function  can  be  written  as 

.  .  ,  ( x-x2 )2  (x-xx)2  S2 

Assign  to  class  nx  : - - - - - >  In— 

o  2  il  *->1 

A  .  (x-x2)2  ( x-x\)z  S2 

Assign  to  class  n2  : - — - - - - S  In— 

O 2  O]  ii 

An  example  of  the  decision  regions  created  by  a  univariate  quadratic  discriminant 
function  is  shown  in  figure  2. 


CLASSIFY 


Figure  2.  Quadratic  univariate  example. 
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THE  PREDICTIVE  DENSITY— A  BAYESIAN  APPROACH 


A  Bayesian  approach  to  classification  is  based  on  computing  the  predictive  density  the 
probability  p(z  |  x),  S„  jr,),  where  z  is  a  future  (p-dimensional)  observation.  The  predic¬ 
tive  density  replaces  the  likelihood  function  in  the  classification  rule.  The  predictive 
density  incorporates  knowledge  about  how  many  sample  points  are  being  used  to  estimate 
the  density,  and  therefore  the  predictive  density  becomes  a  function  of  the  sample  size  Nt  ■ 
Given  definitions  of  x),  S,-,  based  on  Nt  observations  previously  defined,  a  Bayesian 
derivation  of  the  probability  that  the  observation  z  belongs  to  class  jt,  (i.e., 
p(yr,  |  x),  S„  z)  for  i  =  1,  ...  ,  K)  follows.  For  details  on  the  Bayesian  approach,  see  Press 
[1982].  The  following  derivation  is  due  to  Geisser  [1964]. 

Define  the  joint  prior  probability  density  of  m  and  Z^  as  the  following: 

gCZdojiiaZf1  «  lUil^^ajiiOZr1 

where  p  is  the  dimension  of  .  This  joint  prior  probability  is  called  a  reference  prior 
and  indicates  no  prior  knowledge  of  the  distribution.  Another  constraint  is  that  p  <  Nj , 
so  the  covariance  matrix  is  not  singular.  Now  we  have  from  Geisser  [1964,  p.72]  that 
the  joint  density  of  x,-,  S ,  ,  conditional  on  the  parameters  ,  Z,-3 ,  and  71 1 ,  has  the  fol¬ 
lowing  form: 

p(x},  «  |S,|W-2-p)/2|2.|- N,/2 

x  exp  { - -trZ,-1[(N,  -  1)S,  +  (x,  -^)^  -//,)']} 

2 

Multiplying  p(xiy  S,  |  Z,"3,  yr, )  by  the  joint  prior  distribution  gives 

p(Mi,  Z/"1  |  Si,  S„  Jid  «  |Z,|^«-p-i)/2exp{  --trZ'3[(N,  -  1)S,  +  (x,  -&)(%  -/*,)']} 

2 

Integrating  over  the  parameters  /*,  and  Z,-1,  the  predictive  density  for  an  observation  z 
is  obtained: 

p( Z  |  X,, S„ 7ti)  =  I  jp(x  j  pi,'Zi-\ni)p(pi,X^\xi,Si,Jti)(jpidEi~'1 


1 


Classification  of  an  observation  z  into  a  class  tt,  is  according  to 

p(Ui  |  Z,  X„  Si)  «  P(Z  |  fj,  Si,  7Zi)Pi 

where  pi  is  the  prior  probability  of  class  jt,  .  Therefore,  the  two-class  decision  rule  is 
the  following: 

*  •  ,  pfrr  1  I  z,x},Si) 

Assign  to  class  n\  :  —7 — - — - — -  ^  1 
p{n2  |  2,  x2,  S2 


Assign  to  class  ?r2  : 


p(?ri  1  z,  Xj,Si) 
p(ji2  |  z,  x2,  S2 


<  1 


This  is  the  predictive-odds  ratio  for  classifying  z  into  7i\  as  compared  with  n2. 
Expanding  the  decision  rule  gives  the  following  result: 

Assign  to  class  jt\  if 


Otherwise  assign  the  observation  to  class  ^2.  This  can  be  simplified  to  the  following 
form: 


Assign  to  class  7t\  : 


K 
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L-* 


(x-x2)'S21(x-x2)I1/2^ 


-x2)j 


^  +  jvJTT  ”  Xl^Sl"  ^X  “  Xl^) 


2:  1 
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otherwise  class  n 2,  where  Ki2  >s  a  constant  not  depending  on  x 


The  predictive  discriminant  function  is  the  ratio  of  two  multivariate  Student  t-distribu- 
tions.  This  gives  the  function  several  nice  properties.  One,  the  sample  sizes  need  not  be 
the  same  for  the  function  to  be  applicable,  as  is  implicit  in  the  likelihood-based 
approaches.  Two,  since  the  discriminant  function  is  a  quadratic  function,  the  covariance 
matrices  need  not  be  equal,  which  is  required  (or  assumed)  by  Fisher’s  linear  dis¬ 
criminant  function.  Figure  3  is  an  example  of  the  predictive  distributions  generated  from 
two  t-distributions  with  equal  sample  variances  but  unequal  number  of  samples  per  class. 


Figure  3.  Univariate  predictive  distributions 
for  N(0,1)  with  N=  3  vs.N(4,l)  with  N= 6. 


The  extension  of  the  two-class  case  to  a  multiclass  decision  rule  is  simply 


Assign  to  class  :ik :  p(nk\z,  xk,  S*)  >  p(ji2\z,Xi,  S,)  V/  *  k 
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For  the  univariate  case,  the  decision  rule  is  simplified  to  the  following: 


Assign  to  class  nx  : 


,  .  N2  (x-x2)2V2"2 

,  N22+l  S2  ) 

- - - -  >  i 

.  Ni  (x-xt)2)^ 

'  N2+ 1  5,  J 


where  Kx2  is  defined  as 


The  multivariate  (univariate)  t-distribution  incorporates  the  information  of  the  variabil¬ 
ity  of  the  mean  and  variance  attributable  to  the  number  of  training  samples.  It  is 
asymptotically  normal,  and  therefore  one  could  view  the  linear  and  quadratic  functions  as 
asymptotically  approaching  a  t-distribution  as  n  gets  large  (rather  than  the  converse).  Note 
that  the  number  of  points  needed  for  the  multivariate  likelihood  and  predictive  dis¬ 
criminant  functions  is  a  function  of  the  dimensionality  of  the  data.  The  higher  the 
dimensionality,  a  greater  number  of  training  samples  are  needed  for  a  reasonable  and 
reliable  estimation  of  the  parameters.  The  relationship  of  class  training  sample  size  to 
misclassification  rates,  and  therefore  discriminant  performance,  will  be  investigated  in  the 
next  section. 
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NUMERICAL  SIMULATION  AND  ANALYSIS 

To  understand  the  relationship  and  capabilities  of  the  discriminant  functions  discussed 
in  the  previous  sections,  Monte  Carlo  simulations  have  been  made  to  measure  the  effects 
of  modifying  (1)  the  parameter  values  of  the  underlying  class  populations,  and  (2)  the 
number  of  training  samples  for  each  class,  for  the  misclassification  rates  of  the 
discriminant  functions.  A  Monte  Carlo  analysis  of  the  relationship  between  the  training 
sample  size  and  misclassification  rates  of  the  discriminant  functions  is  also  performed  on 
a  multivariate  data  set,  i.e.,  Fisher’s  Iris  data. 

UNIVARIATE  TWO-CLASS  CASE 

The  analysis  of  the  univariate  two-class  case  was  performed  as  follows:  The  experi¬ 
mental  parameters  of  population  class  variance  and  training  sample  size  were  set  at 
various  values,  and  the  probabilities  of  misclassification  were  derived  through  Monte 
Carlo  simulation.  The  simulations  were  performed  according  to  the  following  algorithm: 

1.  For  each  class,  the  population  parameter  values,  and  the  number  of  training 
samples,  were  sent  via  the  argument  list  to  the  program. 

2.  The  program  generated  a  training  set  (of  the  specified  number)  as  well  as  a  test 
set  of  random  numbers  (currently  1000).  Half  of  these  test  cases  were  gener¬ 
ated  from  a  normal  distribution  with  class  specified  parameters  (i.e.,  mean 
and  variance).  The  other  half  of  the  test  cases  were  generated  from  a  normal 
distribution  with  class  n2  specified  parameters. 

3.  Each  test  point  (from  both  classes)  was  classified  to  belong  to  class  nx  or  class 
Ji2  by  the  discriminant  functions.  A  count  of  the  misclassifications  was  kept 
and  the  results  were  printed. 

4.  Steps  2  and  3  were  repeated  a  number  of  times  (300  in  the  cases  presented 
below),  generating  misclassification  samples  for  the  discriminant  functions,  as 
a  function  of  the  parameter  values  passed  to  the  program.  The  average  of  these 
misclassification  samples  was  output  as  the  result. 

MISCLASSIFICATION  RATES  AS  A  FUNCTION 
OF  TRAINING  SIZE  AND  RATIO  OF  VARIANCES 

To  compare  the  capabilities  of  the  three  discriminant  functions  previously  discussed  in 
the  small  training  sample  problem,  the  average  misclassification  rates  of  the  discriminant 
functions  were  measured  and  compared  over  several  different  training  sample  cases.  With 
the  different  training  sample  sizes,  and  for  each  case  of  training  sample  sizes,  the 
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variance  of  class  n2  ranged  in  value.  For  the  given  class  training  sample,  an  analysis  was 
performed  of  the  effect  of  the  relative  variance  ratio  between  the  classes  and  discriminant 
misclassification  rate.  In  all  cases,  the  prior  probabilities  of  the  two  classes  were 
considered  equal.  The  details  of  the  analysis  follows. 

Class  n i  was  defined  to  have  a  normal  (0,1)  distribution.  Class  n2  had  a  normal 
(4, a2)  distribution,  where  a2  was  varied  from  ^  to  256.  Figure  4  graphically  details 
several  examples  of  the  class  n\  and  n2  configurations.  Since  the  “distance”  between  two 
univariate  populations  is,  in  the  Mahalanobis  sense,  a  function  of  the  variances  of  the 
populations,  the  distance  between  the  classes  was  modified  despite  the  fact  that  the 
centers  of  the  distributions  were  kept  constant  throughout  the  analysis.  By  allowing  the 
variance  of  class  n2  to  vary  in  value,  the  distance  between  the  two  classes  can  be  viewed 
as  a  function  of  the  class  n2  variance. 


Figure  4.  Class  n\  and  class  n2  populations,  showing 
several  n2  distributions  with  different  variances. 


The  simulations  were  performed  for  several  cases  of  equal  and  unequal  training 
samples  (figures  5-14  and  tables  1-10).  In  case  1,  both  classes  had  an  equal  number  of 
training  samples.  Case  2  was  an  example  of  unequal  samples  from  the  classes.  Case  3 
reversed  the  unequal  number  of  samples  from  each  class,  to  see  if  the  discriminant 
functions  were  biased  by  sample  size.  Cases  4,  5,  and  6  provided  more  information  about 
various  sample  size  configurations  and  the  average  misclassification  rates  for  the 
discriminant  functions.  Cases  7  and  8  investigated  the  misclassification  rates  when 
distance  between  the  means  of  the  classes  was  increased  and  decreased,  respectively. 
Cases  9  and  10  explored  the  variation  of  misclassification  rates  caused  by  varying  the 
sample  size  of  one  of  the  distributions. 
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An  explanation  of  figures  5-14  is  in  order  at  this  time.  The  range  of  the  plot  is  a 
logarithmic  scale,  based  on  the  ratio  of  the  standard  deviations  of  the  class  populations. 
This  linearizes  the  ratio  of  the  standard  deviation,  moving  the  case  of  equal  standard 
deviations  to  the  point  0,  and  assigns  equidistant  points  to  reciprocal  values  of  the  ratio. 

Case  1.  Fqual  number  of  training  samples  for  each  class. 


-2-1  12  ln[o2/al] 


Figure  5.  Probability  of  misclassification  for  various  class  jr2 
variances,  given  an  equal  number  of  training  samples  Nj  =  N2  =  6  . 

Table  1.  Average  misclassification  rates  for  discriminant  functions,  given 

various  ratios  of  the  population  standard  deviations  — ,  with  Nl=N2  =  6  . 

oi 

Disc. 

Func.  Ratios  of  Standard  Deviations 


linear 


quad 

pred 
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Case  2.  Unequal  number  of  training  samples  for  each  class. 


Figure  6.  Probability  of  misclassification  for  various  class  n2 
variances,  given  an  unequal  number  of  training  samples  N\  =  6,  N2  -  18  . 


Table  2.  Misclassification  rates  for  discriminant  functions,  given  various 

ratios  of  the  population  standard  deviations  — ,  with  Ni  =  6,  N2  *  18  . 

oi 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.0123 

0.0119 

0.0124 

0.0148 

0.0259 

0.0610 

0.0939 

0.1750 

0.2857 

quad 

0.0003 

0.0019 

0.0077 

0.0148 

0.0297 

0.0671 

0.1012 

0.1646 

0.0889 

pred 

0.0001 

0.0014 

0.0061 

0.0125 

0.0279 

0.0622 

0.0945 

0.1603 

0.0814 

These  results  are  fairly  comparable  to  case  1,  with  the  increased  number  of  training 
samples  from  class  2  lowering  the  misclassification  rates  of  all  the  discriminant  functions. 


14 


Case  3.  Unequal  number  of  training  samples— numbers  exchanged  from  case  2. 


Figure  7.  Probability  of  misclassification  for  various  class  jt2 
variances,  given  an  unequal  number  of  training  samples  Nt-  18,  N2  -  6  . 


Table  3.  Misclassification  rates  for  discriminant  functions,  given  various 
ratios  of  the  population  standard  deviations  — ,  with  =  18,  N2  =  6  . 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.0118 

0.0118 

0.0121 

0.0128 

0.0254 

0.0621 

0.0981 

0.1899 

0.3092 

quad 

0.0030 

0.0049 

0.0088 

0.0170 

0.0315 

0.0612 

0.0926 

0.15143 

0.0736 

pred 

0.0002 

0.0017 

0.0064 

0.0127 

0.0281 

0.0599 

0.0917 

0.1499 

0.0726 

The  results  are  comparable  to  case  2.  Note  that  the  quadratic  function  misclassifica¬ 
tion  rate  is  an  order  of  magnitude  larger  than  its  value  in  case  2  when  the  ratio  of 
standard  deviations  is  iV  . 
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Case  4.  Differing  number  of  training  samples. 


Figure  8.  Probability  of  misclassification  for  various  class  ti2 
variances,  given  an  unequal  number  of  training  samples  N]  »  6,  N2  -  3  . 


Table  4.  Average  misclassification  rates  for  discriminant  functions  for 
various  ratios  of  the  population  standard  deviations  — ,  with  Nx  =  6,  N2  =  3  . 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.0141 

0.0146 

0.0146 

0.0149 

0.0296 

0.0685 

0.1101 

0.2151 

0.3162 

quad 

0.0227 

0.0279 

0.0416 

0.0460 

0.0625 

0.0916 

0.1211 

0.1786 

0.0958 

pred 

0.0013 

0.0075 

0.0164 

0.0228 

0.0400 

0.0746 

0.1082 

0.1697 

0.0871 
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Case  5.  Exchanged  number  of  training  samples  from  case  4. 


Figure  9.  Probability  of  misclassification  for  various  class  n2 
variances,  given  an  unequal  number  of  training  samples  N3  -  3,  N2  ~  6  . 


Table  5.  Average  misclassification  rates  for  discriminant  functions  for 

°2 

various  ratios  of  the  population  standard  deviations  — ,  with  Nx  =  3,  N2  -  6  . 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.0141 

0.0146 

0.0146 

0.0294 

0.0294 

0.0659 

0.1018 

0.1952 

0.3051 

quad 

0.0042 

0.0088 

0.0256 

0.0394 

0.0629 

0.1043 

0.1407 

0.2062 

0.1228 

pred 

0.0004 

0.0037 

0.0128 

0.0210 

0.0437 

0.0795 

0.1164 

0.1877 

0.1033 
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Case  6.  Extremely  small  number  size  for  class  ^2 . 


Figure  10.  Probability  of  m^classification  for  various  class  n2 
variances,  given  an  unequal  number  of  training  samples  N2  -10,N2-2. 


Table  6.  Average  misclassification  rates  for  discriminant  functions  for 

various  ratios  of  the  population  standard  deviations  — ,  with  Nx  =  10,  N2  =  2  . 

01 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.0117 

0.0116 

0.0128 

0.0155 

0.0306 

0.0695 

0.1102 

0.2320 

0.3183 

quad 

0.0718 

0.0765 

0.0962 

0.0956 

0.0989 

0.1197 

0.1494 

0.1951 

0.1070 

pred 

0.0023 

0.0076 

0.0143 

0.0224 

0.0384 

0.0739 

0.1088 

0.1651 

0.0827 

The  behavior  of  the  quadratic  discriminant  function  in  case  4  and  the  above  case 
clearly  shows  an  inability  to  handle  small  sample  sizes  even  when  the  distributions  are 
quite  separated.  Indeed,  the  quadratic  performance  is  worse  than  the  linear  in  these  cases, 
even  when  the  variances  are  quite  different.  The  predictive  distribution  does  not  exhibit 
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the  same  problem,  but  handles  small  sample  sizes  well  across  all  values  of  the  class  n2 
standard  deviations. 


Case  7.  Distance  between  means  of  classes  is  increased. 


Figure  11.  Average  misclassification  rates  when  distance  between 
means  is  increased  N\  =  N2  =  6  . 


Table  7.  Average  misclassification  rates  when  distance  between  means 

is  increased,  for  various  ratios  of  the  population  standard  deviations 

— ,  with  Ni=N2  =  6. 

°\ 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.0008 

0.0008 

0.0008 

0.0009 

0.0019 

0.0143 

0.0375 

0.1189 

0.2853 

quad 

0.0004 

0.0006 

0.0020 

0.0032 

0.0094 

0.0198 

0.0386 

0.1062 

0.0828 

pred 

0.0000 

0.0001 

0.0009 

0.0018 

0.0064 

0.0160 

0.0330 

0.1002 

0.0757 
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Case  8.  Distance  between  means  is  decreased,  with  equal  sample  sizes. 


Figure  12.  Average  misclassification  rates  when  distance  between 
means  is  decreased  Ny~N2-  6  . 

Table  8.  Average  misclassification  rates  when  distance  between  means 
is  decreased,  for  various  ratios  of  the  population  standard  deviations 
— ,  with  Ny  =  N2  =  6  . 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.0809 

0.0822 

0.0995 

0.1223 

0.1674 

0.2193 

0.2565 

0.3356 

0.3318 

quad 

0.0256 

0.0604 

0.1017 

0.1355 

0.1816 

0.2330 

0.2572 

0.2201 

0.0920 

pred 

0.0185 

0.0541 

0.0978 

0.1296 

0.1770 

0.2279 

0.2522 

0.2175 

0.0839 
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Case  9.  Misclassification  rate  as  a  function  of  class  #2  training  sample  size, 

02=—  . 

16 


Figure  13.  Misclassification  rates  as  a  function  of  class  jt2 
training  sample  size,  with  01  =  _ L  . 

a\  16 


Table  9.  Misclassification  rates  for  various  ratios  of  N2  , 
given  —  =  — ,  =  15,  ^1  =  0,  /22  =  2  . 

aj  16 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.0806 

0.0804 

0.0821 

0.0802 

0.0801 

0.0798 

0.0815 

0.0801 

0.0809 

quad 

0.1194 

0.0626 

0.0427 

0.0321 

0.0249 

0.0200 

0.0172 

0.0155 

0.0147 

pred 

0.0363 

0.0259 

0.0206 

0.0186 

0.0168 

0.0156 

0.0148 

0.0143 

0.0138 

As  shown  before  and  also  seen  here,  the  quadratic  discriminant  function  seems  to  be 
unstable  in  the  parameter  estimations  at  low  sample  sizes,  and  the  misclassification  rate 
is  worse  than  the  linear,  even  though  the  variances  are  very  different.  The  predictive 
function  does  not  show  this  problem,  but  is  the  best  discriminant  function  for  this  case. 
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Case  10.  Misclassification  rate  as  a  function  of  class  ^2  training  sample  size 


02  =  1  . 


Figure  14.  Misclassification  rates  as  a  function  of  class  ji2 
training  sample  size,  with  ~  =  1  . 


Table  10.  Misclassification  rates  for  various  ratios  of  Nz  , 
given  —  =  1,  Nj  =  15,  =  0,  fi2  -  2  . 

°i 


Disc. 

Func. 

Ratios  of  Standard  Deviations 

1/16 

1/4 

1/2 

2/3 

1 

1.5 

2 

4 

16 

linear 

0.1782 

0.1691 

0.1683 

0.1656 

0.1647 

0.1636 

0.1638 

0.1637 

0.1625 

quad 

0.2377 

0.1976 

0.1839 

0.1772 

0.1724 

0.1696 

0.1664 

0.1662 

0.1635 

pred 

0.2017 

0.1793 

0.1746 

0.1700 

0.1685 

0.1680 

0.1657 

0.1655 

0.1634 

When  the  variances  are  equal  in  value,  the  misclassification  rates  of  the  predictive  and 
quadratic  discriminant  functions  are  marginally  larger  than  the  linear  function,  which 
should  be  the  clear  winner.  Note  that  the  predictive  and  quadratic  functions  quickly 
converge  on  the  linear  function  in  performance. 
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MULTIVARIATE  CASE -FISHER’S  IRIS  DATA 

An  analysis  was  performed  of  the  misclassification  rates  of  the  discriminant  functions 
on  a  real  multivariate  data  set  to  examine  the  effect  of  small  sample  sizes  on  the 
discriminant  functions  in  a  multivariate  environment. 

The  data  consisted  of  three  classes  of  Iris  flowers.  This  was  the  data  set  used  by 
Fisher  in  analyzing  the  linear  discriminant  function.  The  parameters  measured  were  petal 
width,  petal  length,  sepal  width,  and  sepal  length.  The  data  consisted  of  50  measurements 
from  each  class.  Figures  15,  16,  and  17  are  projections  of  the  data  for  various  features. 
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Figure  15.  Fisher  Iris  data— petal  length  vs.  petal  width. 
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Figure  16.  Fisher  Iris  data— sepal  length  vs.  sepal  width. 
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°  I.  Versicolor 

*  I.  Setosa 

•  I.  Verginica 


Figure  17.  Fisher  Iris  data— sepal  length  vs.  petal  width. 

Note  that  the  I.  Setosa  is  well  separated  from  the  other  two  plants.  The  misclassifica- 
tions  therefore  will  occur  mostly  fc.  tween  I.  Versicolor  and  /.  Verginica.  For  this  reason,  the 
simulations  will  examine  misclassification  rates  when  the  training  sample  sizes  of 

I.  Versicolor  and  I.  Verginica  are  modulated.  Another  reason  for  this  is  that  the  petal  width 
of  I.  Setosa  does  not  vary  much,  and  generating  small  samples  from  this  class  quite  often 
generates  a  degenerate  covariance  matrix.  This  is  due  to  the  fact  that  the  measurements 
are  integers,  and  a  random  set  of  observations  might  have  the  same  value  for  a  feature, 
thereby  creating  a  singular  covariance  matrix.  Indeed,  the  integer  nature  of  all  the 
observations  preclude  very  small  sample  size  comparisons  (if  the  samples  are  to  be 
independently  selected). 

The  analysis  varied  the  training  sample  sizes  from  the  classes  of  data,  and  the 
probabilities  of  misclassification  were  derived  through  Monte  Carlo  simulation.  The 
simulations  were  performed  as  follows: 

1.  The  training  sample  sizes  for  the  three  classes  were  read  from  the  command 
line.  The  observations  were  also  read  into  the  program  from  a  file.  Since  it  is 
“bad”  practice  to  test  on  an  observation  used  for  training,  the  observations  for 
each  class  were  randomly  divided  into  a  training  and  a  testing  set.  If  an  obser¬ 
vation  was  used  in  training,  it  was  not  used  to  test  the  discriminant  functions. 

2.  For  each  class,  a  uniform  random  number  generator  generated  a  set  of  real 
numbers  that  were  converted  to  integers  in  the  range  of  the  number  of  observa¬ 
tions  for  the  class.  Since  duplicate  numbers  can  be  generated,  the  process  was 
repeated  for  duplicate  numbers  until  the  members  in  the  set  were  unique. 
These  numbers  were  used  as  indexes  into  the  array  of  observations,  indicating 
the  observations  to  be  used  as  the  training  set. 

3.  The  observations  corresponding  to  the  index  values  were  used  as  the  training 
set,  and  a  sample  mean  and  covariance  were  generated.  The  observations  not 
indexed  were  used  as  the  testing  set. 

4.  The  discriminant  functions  were  tested  with  the  test  set,  and  the  misclassifica¬ 
tion  rate  was  computed. 
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5.  Steps  2  through  4  were  repeated  for  a  number  of  iterations,  and  the  averaged 
results  were  reported. 

MISCLASSIFICATION  RATES  AS  A  FUNCTION  OF  TRAINING  SAMPLE 
SIZE 

To  measure  the  misclassification  rates,  the  training  sample  sizes  of  I.  Versicolor  and 
I.  Verginica  were  varied  and  the  results  are  shown  below.  Misclassification  rates  were 
measured  when  (1)  one  class’  training  sample  size  was  varied,  with  the  others’  remaining 
constant,  and  (2)  two  classes’  training  samples  were  varied  concurrently.  The  results  are 
shown  in  figures  18  and  19,  with  corresponding  tables  11  and  12. 

Note  that  the  projections  of  the  covariances  of  the  three  classes  of  plant  are  quite 
similar,  and  therefore  one  should  expect  that  the  linear  discriminant  function  will 
outperform  the  quadratic  and  predictive  discriminant  functions  for  this  data  set.  Also,  due 
to  the  discrete  nature  of  the  data,  a  fairly  large  set  (>  10)  of  class  7i\  (/.  Setosa )  training 
samples  was  required  during  the  simulations.  This  is  due  to  the  fact  that  the  petal  length 
of  the  1.  Setosa  is  predominantly  the  value  2,  and  invariably  a  small  random  sample  of  the 
class  would  cause  a  singular  covariance  matrix  (all  the  training  samples  would  have  a 
petal  length  of  2).  The  large  set  of  /.  Setosa  helps  the  linear  discriminant  function 
stabilize  the  pooled  covariance  matrix  and  thereby  improves  the  linear  function’s 
performance. 

Case  1.  I.  Versicolor's  training  sample  size  varied,  other  classes’  training  sizes 
constant. 


7.5  10  12.5  15  17.5  20  22.5  25 

SAMPLE  SIZE 

Figure  18.  Misclassification  rates  for  various  sample  sizes  of  I.  Versicolor. 
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Table  11.  Misclassification  rates  for  various  values  of  N2  , 
given  Ni  =  20,  W3  =  15  . 


Disc. 

Func. 

Values  of  N2 

6 

8 

10 

12 

15 

20 

25 

linear 

0.0519 

0.0497 

0.0483 

0.0482 

0.0451 

0.0424 

0.0429 

quad 

0.1783 

0.1145 

0.0937 

0.0898 

0.0885 

0.0968 

0.1080 

pred 

0.2074 

0.1213 

0.0937 

0.0752 

0.0681 

0.0609 

0.0571 

In  this  case,  the  predictive  function  shows  more  instability  at  lower  sample  sizes  than 
the  quadratic  function,  but  at  higher  sample  sizes  converges  quickly  to  the  performance  of 
the  linear  discriminant  function.  Again,  it  is  expected  that  the  linear  function  will  perform 
best  because  of  the  similarity  of  the  covariance  matrices  of  the  classes.  Notice  that  the 
quadratic  function  is  not  converging,  but  is  diverging  at  the  higher  sample  sizes.  Whether 
this  is  a  true  divergence  or  just  an  anomaly  due  to  the  variation  in  sampling  size  is  not  ar 
issue,  but  the  quadratic  function  is  not  converging  to  the  performance  of  the  lineai 
discriminant  function  for  these  sample  sizes. 

Case  2.  I.  Versicolor  and  /.  Verginica  training  sample  sizes  varied  together,  I.  Setosa 
training  size  constant. 
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Figure  19.  Misclassification  rates  for  various  equal  sample 
sizes  for  /.  Versicolor  and  /.  Verginica. 
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Table  12.  Misclassification  rates  for  various  values  of 
N2  =  N2,  given  Nj  =  20  . 


Disc. 

Func. 

Value  of  N2  and 

n3 

n2,  n2 

6 

8 

10 

12 

15 

20 

25 

,  linear 

0.0624 

0.0581 

0.0516 

0.0485 

0.0451 

0.0399 

0.0350 

quad 

0.1783 

0.1241 

0.1025 

0.0973 

0.0885 

0.0848 

0.0758 

pred 

0.1618 

0.1165 

0.0871 

0.0739 

0.0681 

0.0602 

0.0524 

Note  that  this  test  gives  the  quadratic  discriminant  function  a  better  chance  to  perform 
well,  since  one  of  the  underlying  assumptions  of  the  quadratic  function  is  that  of  equal 
training  samples  from  each  class.  The  two  classes  which  overlap  in  the  feature  space 
(I.  Versicolor  and  7.  Verginica )  are  the  two  classes  with  an  equal  number  of  training 
samples.  Note,  however,  that  the  quadratic  discriminant  function  is  still  outperformed  by 
the  predictive  discriminant  function. 
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UNIVARIATE  CONCLUSIONS 

Predictive  Discriminant  Function  Versus  Quadratic  Discriminant 
Function 

The  quadratic  discriminant  function  is  asymptotically  predictive— as  sample  sizes  go 
up,  the  quadratic  function’s  misclassification  rates  approach  those  of  the  predictive 
discriminant  function.  At  low  sample  sizes,  quadratic  function  shows  instability  in 
variance  estimates  for  well-separated  populations,  and  its  performance  can  be  worse  than 
even  that  of  the  linear  discriminant  function  for  small  sample  sizes  and  unequal 
variances.  A  very  important  point  is  that,  in  every  case,  and  in  every  univariate  simulation 
made,  the  likelihood-based  quadratic  function  has  been  inferior  (i.e.,  higher  misclassifica¬ 
tion  rates)  to  the  predictive  discriminant  function. 

Predictive  Discriminant  Function  Versus  Linear  Discriminant  Function 

The  linear  discriminant  function  outperforms  the  predictive  discriminant  function 
when  covariances  are  near-equal,  because  the  assumption  of  equal  variance  allows  the 
use  of  pooled  variance  for  the  linear  discriminant.  Linear  performance  is  poor  when  the 
underlying  population  variances  are  not  close  in  value.  Note  that  even  when  the  variances 
are  equal,  the  predictive  discriminant  function  performs  nearly  as  well  as  the  linear 
discriminant  function,  especially  when  compared  to  the  performance  of  the  quadratic 
discriminant  function. 

MULTIVARIATE  CONCLUSIONS 

Predictive  Discriminant  Function  Versus  Quadratic  Discriminant 
Function 

In  the  multivariate  case,  the  predictive  discriminant  function  displays  the  interesting 
ability  of  instability  when  the  sample  sizes  are  small.  However,  the  performance  of  the 
quadratic  discriminant  function  is  shown  to  be  equally  unstable,  and  the  quadratic 
function  misclassification  rate  decreases  more  slowly  as  sample  size  increases.  This 
instability  is  understandable  when  one  understands  that  both  the  quadratic  and  predictive 
discriminant  functions  are  attempting  to  estimate  a  four-dimensional  covariance  matrix 
with  a  sample  size  of  as  few  as  six  observations.  It  is  therefore  understandable  that  the 
error  rates  for  these  discriminant  functions  are  quite  large  with  six  points.  Note  that  the 
predictive  discriminant  function  falls  quickly  as  a  function  of  the  training  sample  size  n, 
with  n  =  2  or  3  times  the  dimension  of  the  observation  vector  X  for  a  fairly  stable  estimate 
of  the  covariance  matrix. 
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Predictive  Discriminant  Function  Versus  Linear  Discriminant  Function 

Since  the  data  in  the  multivariate  case  appear  to  have  similar  covariance  structures,  it 
was  expected  that  the  linear  discriminant  function  would  outperform  the  predictive 
discriminant  function  for  the  Fisher  Iris  data.  This  has  been  shown  to  be  true.  The  linear 
discriminant  function  is  able  to  use  the  similarity  of  the  covariance  matrices  to  stabilize 
its  estimate  of  the  covariance  structure  of  the  classes. 

FINAL  STATEMENT 

To  conclude,  the  predictive  and  linear  discriminant  functions  generally  are  (1)  and  (2) 
in  performance.  The  linear  function  performs  better  when  variances  are  similar,  while  it 
performs  poorly  if  the  variances  of  the  distributions  vary  widely.  The  predictive  dis¬ 
criminant  function  shows  better  performance  than  the  quadratic  function  in  almost  all 
situations,  since  it  is  taking  into  account  more  information  (i.e.,  sample  sizes).  Therefore, 
in  case  of  classes  with  small  training  sample  sizes,  the  predictive  discriminant  function  is 
preferable  to  the  quadratic  discriminant  function  in  minimizing  misclassification  rates. 

This  work  has  shown  that,  for  small  training  set  classifications,  the  predictive 
discriminant  function  should  not  be  neglected.  The  predictive  discriminant  function  is 
versatile  in  its  application,  minimizes  the  number  of  assumptions  made,  and  performs 
reasonably  well  over  the  range  of  cases  tested. 
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